8 research outputs found
On the dynamics of interdomain routing in the Internet
The routes used in the Internet's interdomain routing system are a rich
information source that could be exploited to answer a wide range of
questions. However, analyzing routes is difficult, because the fundamental
object of study is a set of paths. In this dissertation, we present new
analysis tools -- metrics and methods -- for analyzing paths, and apply them
to study interdomain routing in the Internet over long periods of time.
Our contributions are threefold. First, we build on an existing metric (Routing
State Distance) to define a new metric that allows us to measure the similarity
between two prefixes with respect to the state of the global routing system.
Applying this metric over time yields a measure of how the set of paths to each
prefix varies at a given timescale. Second, we present PathMiner, a system to
extract large scale routing events from background noise and identify the AS
(Autonomous System) or AS-link most likely responsible for the event. PathMiner
is distinguished from previous work in its ability to identify and analyze
large-scale events that may re-occur many times over long timescales. We show
that it is scalable, being able to extract significant events from multiple
years of routing data at a daily granularity. Finally, we equip Routing State
Distance with a new set of tools for identifying and characterizing
unusually-routed ASes. At the micro level, we use our tools to identify
clusters of ASes that have the most unusual routing at each time. We also show
that analysis of individual ASes can expose business and engineering strategies
of the organizations owning the ASes. These strategies are often related to
content delivery or service replication. At the macro level, we show that the
set of ASes with the most unusual routing defines discernible and interpretable
phases of the Internet's evolution. Furthermore, we show that our tools can be
used to provide a quantitative measure of the "flattening" of the Internet
Evaluating LLP Methods: Challenges and Approaches
Learning from Label Proportions (LLP) is an established machine learning
problem with numerous real-world applications. In this setting, data items are
grouped into bags, and the goal is to learn individual item labels, knowing
only the features of the data and the proportions of labels in each bag.
Although LLP is a well-established problem, it has several unusual aspects that
create challenges for benchmarking learning methods. Fundamental complications
arise because of the existence of different LLP variants, i.e., dependence
structures that can exist between items, labels, and bags. Accordingly, the
first algorithmic challenge is the generation of variant-specific datasets
capturing the diversity of dependence structures and bag characteristics. The
second methodological challenge is model selection, i.e., hyperparameter
tuning; due to the nature of LLP, model selection cannot easily use the
standard machine learning paradigm. The final benchmarking challenge consists
of properly evaluating LLP solution methods across various LLP variants. We
note that there is very little consideration of these issues in prior work, and
there are no general solutions for these challenges proposed to date. To
address these challenges, we develop methods capable of generating LLP datasets
meeting the requirements of different variants. We use these methods to
generate a collection of datasets encompassing the spectrum of LLP problem
characteristics, which can be used in future evaluation studies. Additionally,
we develop guidelines for benchmarking LLP algorithms, including the model
selection and evaluation steps. Finally, we illustrate the new methods and
guidelines by performing an extensive benchmark of a set of well-known LLP
algorithms. We show that choosing the best algorithm depends critically on the
LLP variant and model selection method, demonstrating the need for our proposed
approach
Tracking Knowledge Propagation Across Wikipedia Languages
In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow-up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis
Uma análise de fatores que influenciam interações entre usuários do twitter
Exportado OPUSMade available in DSpace on 2019-08-12T12:37:29Z (GMT). No. of bitstreams: 1
giovannicomarela.pdf: 1689570 bytes, checksum: 9de88a86b75b803cef2bad4b1a050090 (MD5)
Previous issue date: 1Nesta dissertação estuda-se o problema de entender interações entre usuários na rede de informação Twitter. O problema é abordado em duas etapas: primeiro, é realizada uma caracterização extensiva de uma grande coleção de dados, através da qual, identifica-se por exemplo que algumas vezes os usuários passam por centenas de mensagens até encontrarem alguma que tem interesse em interagir. Estes resultados motivam a identificação de fatores que influenciam as probabilidades de respostas e compartilhamento de mensagens no Twitter. Na segunda etapa, utilizando algoritmos de aprendizado de máquina, mostra-se que alguns destes fatores podem ser utilizados para melhorar o mecanismo usual de apresentação de mensagens. Estes algoritmos são avaliados através de estudos de simulação, os quais mostram que a fração de mensagens respondidas e compartilhadas próximas ao topo da lista de mensagens dos usuários cresce em até 60%.In information networks where users send messages to one another, the issue of information overload naturally arises: which are the most important messages? In this work we study the problem of understanding the importance of messages in Twitter. We approach this problem in two stages. First, we perform an extensive characterization of a very large Twitter data set which includes all users, social relations, and messages posted from the beginning of the service up to August 2009. We show evidence that information overload is present: users sometimes have to search through hundreds of messages to find those that are interesting to reply or retweet. We then identify factors that influence user response or retweet probability: previous responses to the same tweeter, the tweeter\\\'s sending rate, the age and some basic text elements of the tweet. In our second stage, we show that some of these factors can be used to improve the ordering of tweets as presented to the user. First, by inspecting user activity over time, we construct a simple on-off model of user behavior that allows us to infer when a user is actively using Twitter. Then, we explore two methods from machine learning for ranking tweets: a Naive Bayes predictor and a Support Vector Machine classifier. We show that it is possible to reorder tweets to increase the fraction of replied or retweeted messages appearing in the first positions of the list by as much as 60%
Politics and disinformation: Analyzing the use of Telegram's information disorder network in Brazil for political mobilization
Over the past few years, with the increasing popularization of network communication in place of traditional mass communication, supported by social platforms and messengers, political campaigns have come to rely on new tools and methods, including the use of these structures to promote an environment of information disorder for the purpose of mobilization. This work followed the use of Telegram as a tool for political mobilization in Brazil, collecting data from a dense network of information disorder used to mobilize voters in support of then-president Jair Bolsonaro on 7 September 2021 and 2022, Independence Day in Brazil. The results showed that engagement was reduced, mainly due to the lack of support from certain groups such as anti-vaccination advocates and the truck drivers’ class. There was also a decrease in extremism on discussion themes and lower user activity levels